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Abstract.  In  the  Relevance  Feedback  (RF)  task  the  user  is  directly 
involved  in  the  search  process:  given  an  initial  set  of  results,  he  specifies 
if  they  are  relevant  or  not  to  the  achievement  of  his  information  goal. 
In  the  TREC  2009  RF  track  the  first  five  documents  retrieved  by  the 
baseline  systems  were  judged  by  the  assessors  and  then  used  as  evidence 
for  the  RF  algorithms  to  be  tested.  The  specific  algorithm  we  tested  is 
mainly  based  on  a  geometric  framework  which  allows  the  latent  semantic 
associations  of  terms  in  the  feedback  documents  to  be  modeled  as  a 
vector  subspace;  the  documents  of  the  collection  represented  as  vectors 
of  TF-IDF  weights  were  re-ranked  according  to  their  distance  from  the 
subspace.  The  adopted  geometric  framework  was  used  in  past  works  as 
a  basis  for  Implicit  Relevance  Feedback  (IRF)  and  Pseudo  Relevance 
Feedback  (PRF)  algorithms;  the  participation  to  the  RF  track  allows 
us  to  make  some  preliminary  investigations  on  the  effectiveness  of  the 
adopted  framework  when  it  is  exploited  to  support  explicit  RF  on  much 
larger  test  collections,  thus  complementing  the  work  carried  out  for  the 
other  RF  strategies. 


1  Introduction 

In  TREC  2009  the  Information  Management  System  (IMS)  Research  Group  of 
the  University  of  Padua  (UNIPD)  participated  to  the  RF  Track.  The  track  was 
structured  in  two  phases,  namely  Phase  1  and  Phase  2.  The  purpose  of  Phase 
1  was  to  evaluate  the  systems  capability  of  retrieving  good  documents  to  be 
judged,  that  is,  documents  which  would  be  good  input  for  RF  algorithms  to  be 
tested.  The  aim  of  Phase  2  was  to  evaluate  the  improvement  provided  by  the 
RF  algorithms  when  different  sets  of  judged  documents  were  used  as  input.  We 
submitted  results  both  to  Phase  1  and  Phase  2. 

The  specific  RF  algorithm  we  evaluated  is  based  on  the  geometric  framework 
proposed  in  [1],  which  allows  different  sources  for  feedback  to  be  modeled  as 
vector  subspaces  and  their  models  to  be  exploited  to  predict  relevance.  In  the 
previous  works  the  framework  was  applied  to  two  different  sources. 

The  first  source  was  the  behavior  of  the  user  described  in  terms  of  interac¬ 
tion  features  gathered  by  monitoring  the  interaction  between  the  user  and  the 
Information  Retrieval  (IR)  system  [2].  The  user  behavior  modeled  as  a  vector 
subspace  was  used  to  re-rank  the  documents:  the  most  frequent  keywords  were 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

NOV  2009 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2009  to  00-00-2009 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


4.  TITLE  AND  SUBTITLE 

University  of  Padua  at  TREC  2009:  Relevance  Feedback  Track 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

University  of  Padua, Department  of  Information  Engineering, Padua,  report  number 

Italy, 

9.  SPONSORING/MONITORING  AGENCY  NAME(S )  AND  ADDRESS(ES )  10.  SPONSOR/MONITOR' S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

Proceedings  of  the  Eighteenth  Text  REtrieval  Conference  (TREC  2009)  held  in  Gaithersburg,  Maryland, 
November  17-20,  2009.  The  conference  was  co-sponsored  by  the  National  Institute  of  Standards  and 
Technology  (NIST)  the  Defense  Advanced  Research  Projects  Agency  (DARPA)  and  the  Advanced 
Research  and  Development  Activity  (ARDA). 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

10 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


2 


extracted  from  the  top  n  re-ranked  documents  and  keywords  were  adopted  for 
expanding  the  textual  description  of  the  topic,  which  was  then  considered  as 
a  new,  expanded  query.  That  approach  falls  into  the  class  of  IRF  algorithms, 
since  interaction  features  can  be  gathered  without  an  direct  involvement  of  the 
user  and  their  combination  was  used  as  implicit  indicator  of  the  user  intents  or 
interests. 

The  second  source  for  feedback  used  was  the  “latent  semantics”  [3]  of  the 
terms  appearing  in  the  top  n  retrieved  documents  [1];  the  top  k  weighted  key¬ 
words  in  these  documents  were  adopted  to  extract  the  most  “meaningful”  term 
groups,  as  in  Latent  Semantic  Analysis  (LSA) .  In  practice  the  adopted  approach 
provided  a  vector  subspace  representation  of  the  term  groups;  the  top  m  retrieved 
documents  were  re-ranked  according  to  the  distance  between  their  vector  repre¬ 
sentation  in  terms  of  the  top  weighted  keywords  and  the  computed  subspace. 

Therefore  the  effectiveness  of  the  adopted  geometric  framework  was  tested 
respectively  with  regard  to  IRF  [2]  and  PRF  [1],  The  purpose  of  the  work  carried 
out  in  the  RF  Track  was  to  test  the  effectiveness  of  that  framework  with  regard 
to  Explicit  Relevance  Feedback  (RF)  by  using  a  test  collection  of  two  orders  of 
magnitude  larger  than  those  used  in  the  previous  experiments.  In  particular,  the 
source  for  feedback  used  was  the  content  of  the  top  two  documents  judged  as 
relevant  by  the  assessors  among  the  top  five  documents  retrieved.  The  approach 
proposed  in  [1]  was  applied  to  the  content  of  these  documents  in  order  to  re¬ 
rank  the  top  2500  results  retrieved  by  the  baseline.  The  baseline  adopted  in 
Phase  1  exploited  the  BM25  weighting  scheme  [4]  to  provide  an  initial  ranked 
list  of  results.  Then  the  top  ten  retrieved  documents  were  re-ranked  according 
to  presence  of  the  topic  keywords  in  their  URL’s. 

The  remainder  of  this  paper  is  structured  as  follows.  Section  2  briefly  ex¬ 
plains  the  methodology  for  RF  adopted  in  this  work  and  the  role  of  the  adopted 
geometric  framework  in  such  methodology.  Section  3  focuses  on  the  experiments 
carried  out  during  the  participation  to  the  RF  Track,  moreover  describing  the 
setting  adopted  for  indexing  and  retrieval  both  in  Phase  1  and  Phase  2.  The 
results  obtained  by  the  experiments  described  in  Section  3  are  reported  and 
discussed  in  Section  4.  Finally,  Section  5  reports  some  concluding  remarks. 


2  Methodology 

The  specific  methodology  for  relevance  feedback  we  tested  in  the  RF  Track  of 
TREC  2009  is  that  proposed  in  [5].  The  methodology  is  constituted  by  four 
steps:  (i)  selection  of  the  source  of  feedback,  (ii)  selection  and  collection  of  the 
features,  (iii)  source  modeling  and  (iv)  relevance  prediction. 

As  regards  the  first  step,  the  source  for  feedback  selected  is  the  latent  se¬ 
mantic  structure  in  the  content  of  the  documents  used  as  evidence.  Differently 
from  [1],  where  the  content  of  top  n  retrieved  documents  were  used  as  source 
of  evidence,  in  this  work  the  source  adopted  is  the  content  of  a  subset  T  of 
the  documents  judged  as  relevant  among  the  top  n  retrieved  by  a  first  retrieval 
run  —  in  the  RF  Track  the  initial  run  is  the  Phase  1  run. 
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The  main  assumption  underlying  this  work  is  that  some  terms  appearing  in 
the  documents  in  T  can  be  used  to  predict  what  the  terms  used  by  the  searchers 
really  imply.  In  other  words  the  terms  appearing  in  the  considered  subset  of  the 
feedback  documents  are  the  features  selected  to  model  the  considered  source  for 
feedback.  The  specific  information  adopted  is  the  co-occurrence  of  the  terms  ap¬ 
pearing  near  each  others:  windows  of  text  centered  around  the  terms  can  be  used 
to  capture  “local  co-occurrence”  information.  Suppose  that  the  terms  “music”, 
“restaurant”,  “rock”  and  “jazz”  are  selected  as  features.  If  in  the  documents  in 
T  the  term  “jazz”  tends  to  occur  more  frequently  near  “music”  and/or  “restau¬ 
rant”,  maybe  the  searcher  is  more  interested  in  restaurants  where  jazz  music  is 
played  than  in  those  with  live  rock  music. 

This  local  co-occurrence  information  can  be  extracted  and  prepared  in  a 
matrix  as  follows.  Let  T  be  the  set  of  k  features,  namely  terms,  selected  to 
describe  the  source  and  let  S  £  Rfcxfc  be  a  matrix  whose  elements  are  initially 
set  to  zero,  namely  Sij  =  0  for  1  <  i,  j  <  k.  For  each  term  ti  €  T  a  window  of 
text  centered  around  each  occurrence  of  t,  is  considered;  if  a  term  tj  ^  £  T 

appears  in  the  window  of  text,  statistical  information  about  tj,  e.g.  its  total 
frequency  in  the  collection,  or  a  weight  derived  from  such  information,  e.g.  the 
TF-IDF,  is  added  both  to  Sjj  and  Sjj. 

The  main  question  is  how  to  obtain  a  usable  representation  of  the  source  for 
feedback  adopted  in  order  to  assist  the  prediction  of  the  documents  relevance. 
A  possible  solution  is  that  proposed  in  [1],  where  the  mathematical  construct  of 
the  vector  subspace  is  adopted  to  model  sources.  The  main  issue  is  how  to  obtain 
a  vector  subspace  representation  starting  from  the  information  collected  by  the 
observation  of  the  selected  features.  A  possible  solution,  which  is  the  approach 
actually  adopted  in  this  work,  is  to  apply  Singular  Value  Decomposition  (SVD) 
to  S  and  select  the  first  principal  eigenvector. 

This  vector  spans  a  subspace  which  can  be  used  to  re-rank  the  documents  in 
the  collection,  that  is  to  implement  the  relevance  prediction  step  of  the  method¬ 
ology.  This  goal  can  be  achieved  by  the  adoption  of  a  trace-based  function  —  the 
idea  of  using  trace-based  functions  in  IR  was  originally  proposed  in  [6]  and  sub¬ 
sequently  developed  in  [1].  Let  us  denote  with  b  the  first  principal  eigenvector 
among  those  provided  by  SVD  and  denote  with  L({b})  the  subspace  spanned 
by  b.  We  are  interested  in  measuring  the  degree  to  which  the  latent  semantic 
structure  modeled  as  subspace  is  present  in  the  documents  of  the  collection,  and 
rank  the  documents  according  to  this  measure.  The  mentioned  function  mea¬ 
sures  the  distance  between  the  vector  representation  of  the  document  y  and  the 
subspace  L({b}),  that  is  the  projection  y{bj  of  y  onto  L({b}).  More  formally, 
the  function  adopted  is  the  following: 


m{b}(y)  =  yT-P{b} -y,  (1) 

where  P{bj  =  b  •  bT  is  the  projector  onto  the  subspace  L({b}). 

The  measure  provided  by  Equation  1  is  a  probability  measure,  as  shown 
in  [1],  that  is  ni{b}(y)  =  Pr[L({b})|L({y})],  where  L({y})  denotes  the  subspace 
spanned  by  y. 
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3  Experiments 

The  IR  system  adopted  in  the  experiments  exploits  the  functionalities  provided 
by  Apache  Lucene  [7]  for  indexing  and  retrieval1.  The  specific  choices  made  in 
regard  to  parsing,  indexing  and  retrieval  are  described  in  the  remainder  of  this 
section.  Both  the  experiments  for  Phase  1  and  Phase  2  were  carried  out  on  the 
TREC  2009  ’’Category  B”  dataset,  constituted  by  50,220,423  English  web  pages. 

The  experiments  were  carried  out  on  a  cluster  of  twenty-eight  3  GHz  Intel 
Quad  Core®  E5450,  which  is  available  in  our  department. 

3.1  Parsing  and  Indexing 

Each  web-page  of  the  TREC  2009  ’’Category  B”  dataset  was  parsed,  particu¬ 
larly  the  following  information  was  extracted  from  each  record  in  Web  ARChive 
(WARC)  format:  the  TREC-ID,  the  URI  and  the  content.  Each  of  them  was 
stored  in  a  distinct  Field  of  a  Lucene  Document2.  All  the  content  of  the  doc¬ 
ument  was  processed  during  indexing  except  for  the  text  contained  inside  the 
<scriptx/script>  and  the  <style></style>  tags.  Moreover  an  additional 
field  was  stored,  which  contained  the  keywords  extracted  from  the  URL  of  the 
document.  In  particular  during  the  extraction  of  the  terms  from  the  full  content 
of  the  documents  the  presence  of  each  term  was  checked  in  the  URL;  the  ob¬ 
tained  keywords  were  then  indexed  in  a  separate  field,  which  was  used  in  Phase 
1  to  re-rank  the  top  ten  retrieved  documents. 

Stop  words  were  removed  during  indexing3.  No  stemming  was  adopted.  Dur¬ 
ing  indexing  not  only  statistical  information  about  the  occurrence  of  the  terms  in 
the  documents,  namely  their  frequency,  was  stored,  but  also  information  about 
the  positions  where  terms  occurred  and  offset  information4.  The  information 
about  the  position  of  the  terms  was  used  to  implement  the  methodology  de¬ 
scribed  in  Section  2  and  exploited  for  Phase  2  as  described  in  Section  3.3. 

The  wall-clock  time  to  index  the  1492  records  of  the  TREC  2009  ’’Category 
B”  dataset  was  45  hours,  46  minutes  and  45  seconds,  while  the  CPU  time  was  38 
hours,  3  minutes  and  39  seconds  (36:29:08  user  time  and  01:34:30  system  time). 

3.2  Retrieval:  Phase  1 

The  purpose  of  Phase  1  of  the  RF  Track  was  to  retrieve  good  documents  to  be 
judged,  actually  the  documents  used  as  input  for  feedback  in  Phase  2. 

1  The  specific  version  adopted  in  the  experiments  was  Apache  Lucene  2.4.1 

2  “A  Document  represents  a  collection  of  fields  [.  .  .  ]  Each  field  corresponds  to  a  piece 
of  data  that  is  either  queried  against  or  retrieved  from  the  index  during  search”  [8] 

3  The  stop  words  list  is  that  available  at  the  mi  http://ir.dcs.gla.ac.uk/ 
resources/linguistic_utils/stop_words 

4  In  Lucene  information  about  the  unique  terms  in  a  field,  their  counts,  their  posi¬ 
tions  and  their  offsets  can  be  stored  at  indexing  time  and  then  accessed  by  using 
TermVectors.  The  specific  TermVector  option  chosen  for  the  Lucene  Field  used  for 
the  “content”  was  TermVector  .WITH J>0SITI0NS_0FFSETS 
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Each  of  the  fifty  topics  was  automatically  parsed,  thus  extracting  its  consti¬ 
tuting  terms;  no  stemming  was  adopted  on  the  obtained  terms.  For  each  term 
q i  in  a  topic  we  constructed  a  Lucene  TermQuery  for  the  content  field,  that  is  a 
query  to  retrieve  all  the  documents  where  the  term  q.i  appears  in  their  content 
field.  The  TermQuery’s  constructed  for  the  terms  qd  s  in  a  topic  were  combined 
in  a  Lucene  BooleanQuery:  each  TermQuery  was  considered  as  a  optional  clause, 
that  is  TermQuery’s  were  combined  by  logical  OR5. 

The  weighting  scheme  adopted  was  the  BM25,  particularly  exploiting  the 
implementation  for  Lucene  made  available  in  [9]  which  is  based  on  the  description 
of  the  BM25  presented  in  [10]  and  briefly  described  in  the  following.  Let  Vd  be 
the  set  of  terms  appearing  in  document  D;  the  weight  wy  assigned  to  the  term 


U  G  VD  is 


Wi  = 


tfl 


log 


N-m  +  0.5 
rii  +  0.5 


where  N  is  the  total  number  of  document  in  the  collection,  «,  is  the  number  of 
documents  in  the  collection  where  the  term  t,  appears,  and  k\  is  a  parameter 
which  was  heuristically  set  to  k\  =  2  in  the  experiments.  The  quantity  tf[  is 
defined  as  tf'  =  tfi/B ,  where  tfi  is  the  term  frequency  of  f,,  and 


B  =  (1  —  b)  +  b 


dl 

avdl 


where  dl  =  Ylt  ^vD  the  document  length,  and  avdl  is  the  average  document 
length  in  the  collection.  The  value  of  b  adopted  in  the  experiments  was  b  =  0.75. 

The  top  ten  retrieved  documents  by  BM25  were  re-ranked  according  to  the 
number  of  the  topic  keywords  among  those  extracted  from  the  URL  field.  If  two 
documents  had  the  same  BM25  score  and  the  same  number  of  topic  keywords  in 
the  URL  field,  the  documents  were  ranked  according  to  the  lexicographical  order 
of  their  identifiers.  The  top  five  re-ranked  documents  were  provided  as  results 
for  Phase  1. 


3.3  Retrieval:  Phase  2 

Phase  2  aimed  at  investigating  the  effectiveness  of  the  RF  algorithms  when  dif¬ 
ferent  Phase  1  runs  were  used  as  source  for  feedback.  In  other  words  the  objective 
was  to  test  the  effectiveness  of  the  algorithms  with  regard  to  different  baseline 
systems  and  the  documents  they  provided.  Seven  sets  of  judged  documents  were 
assigned  to  UNIPD,  particularly  those  provided  by  the  Phase  1  runs  CMU.l, 
hit2 . 1,  ilps.2,  PRIS.l,  QUT.l,  UMas . 1  and  UPD.l. 

The  specific  algorithm  we  tested  in  Phase  2  was  that  described  in  Section  2, 
particularly  using  a  subset  of  the  relevant  documents  among  the  top  five  as 
evidence  to  extract  the  latent  semantics  of  the  terms.  The  methodology  is  sum¬ 
marized  in  the  following  steps: 

5  The  specific  boolean  operator  adopted  for  the  Lucene  BooleanQuery  was 
BooleanClause . Occur . SHOULD 
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1.  Selection  of  the  top  h  relevant  documents  among  the  top  five  retrieved  for 
the  specific  topic  and  the  particular  Phase  1  run  considered.  If  the  number 
of  documents  judged  as  relevant  among  the  top  five  retrieved  is  greater  than 
one,  then  the  top  two  relevant  documents  are  selected,  that  is  h  =  2.  If 
only  one  document  is  judged  as  relevant,  that  document  is  selected,  that  is 
h  =  1.  If  there  are  no  relevant  documents  among  the  top  five,  the  baseline 
ranked  list  is  returned  as  result  for  Phase  2,  specifically  the  top  m  =  2500 
documents. 

2.  Selection  of  the  set  T  of  the  top  k  =  5  weighted  terms  in  the  selected  relevant 
documents;  the  weight  of  the  keywords  is  computed  by  TF-IDF. 

3.  Computation  of  the  co-occurrence  matrix  S  by  windows  of  text  —  only  the 
full  text  of  the  selected  relevant  documents  is  used.  In  particular  a  window 
of  text  of  size  11  is  centered  around  each  occurrence  of  a  keyword  ti  €  T. 
If  a  keyword  tj  €  T  appears  in  the  window  of  text  centered  around  ti,  the 
TF-IDF  weight  of  tj  is  added  to  the  elements  s,y  and  Sji  of  S.  The  window 
of  text  never  overlaps  two  distinct  documents. 

4.  Decomposition  of  the  co-occurrence  matrix  S  by  SVD  and  adoption  of  vector 
subspace  L(Rp)  spanned  by  the  first  eigenvector  b  as  model  of  the  selected 
source6 . 

5.  Re-ranking  of  the  top  m  =  2500  results  retrieved  by  the  baseline  according 
to  the  distance  between  the  vector  representation  of  the  documents  and  the 
computed  subspace;  the  specific  function  adopted  is  Eq.  1,  that  is  yT-  P{b}-y, 
where  y  is  the  document  vector  normalized  so  that  ||y||  =  1,  P{b}  =  b  •  bT, 
and  b  is  the  eigenvector  computed  in  the  previous  step. 

The  results  submitted  for  Phase  2  were  the  re-ranked  list  of  documents  ob¬ 
tained  at  step  5  or,  as  mentioned  in  step  1,  the  results  provided  by  the  baseline 
if  there  were  no  documents  judged  as  relevant  among  the  top  five  retrieved.  The 
reason  for  the  latter  choice  is  due  to  the  difference  between  the  “subspace  of 
irrelevance”  and  the  subspace  spanned  by  non  relevant  documents.  Indeed,  as 
stated  in  [1],  if  orthogonality  is  chosen  to  model  mutual  exclusion  and  L(Rp) 
denotes  the  subspace  of  relevance,  L(Rp)±  may  denote  irrelevance.  While  the 
subspace  of  irrelevance  is  orthogonal  to  L(Rp),  L(Rp)  is  in  general  oblique  — 
L{Rp)  denotes  the  subspace  spanned  by  non  relevant  documents.  In  other  words, 
ranking  according  to  1  —  Pr[L(i?^)_L|L({y})]  is  in  general  different  than  ranking 
by  1  —  Pr[L(i?i?)|L({y})].  If  all  the  documents  are  judged  by  searchers  as  non 
relevant,  L(Rp)  can  be  computed  but  not  L(Rp)-1-.  For  this  reason,  if  none  of 
the  top  five  retrieved  documents  were  judged  as  relevant,  the  baseline  results 
were  returned. 

4  Results 

In  Phase  1  the  baseline  we  adopted  was  able  to  retrieve  at  least  one  relevant 
document  among  the  top  five  results  for  37  of  the  50  topics.  Table  1  reports 

6  In  the  experiments  the  JAMA  package  [11]  was  used  to  implement  all  the  function¬ 
alities  for  constructing  and  manipulating  matrices. 
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the  statAP  [12]  computed  for  the  results  returned  by  the  baseline  (B)  and  the 
results  returned  by  the  adopted  RF  algorithm  (RF)  for  the  49  topics7.  Moreover 
Table  1  reports  the  percentage  difference  between  the  baseline  results  and  the 
results  provided  by  the  RF  algorithm.  For  eight  of  the  thirty-seven  topics  the  RF 
algorithm  was  effective  in  terms  of  statAP  —  in  Table  1  the  results  referring  to 
these  topics  are  bolded  —  ,  but  in  general  the  RF  negatively  affected  the  ranked 
list  provided  by  the  baseline. 


Topic 

B 

(stAP) 

RF  Arf-b 

(stAP)  (%) 

Topic 

B 

(stAP) 

RF  Arf-b 

(stAP)  (%) 

1 

0.14444 

0.00031  -99.78468 

28 

0.23796 

0.07298  -69.33195 

2 

0.62849 

0.12205  -80.58025 

29 

0.00220 

0.00000 

3 

0.07932 

0.10391  30.99605 

30 

0.18557 

0.01058  -94.30020 

4 

0.01892 

0.00100  -94.70874 

31 

0.15710 

0.24758  57.59663 

5 

0.16549 

0.00813  -95.08602 

32 

0.25684 

0.10356  -59.67980 

6 

0.05699 

0.00000 

33 

0.43541 

0.32751  -24.78026 

7 

0.03044 

0.00294  -90.35278 

34 

0.00781 

0.00156  -79.98719 

8 

0.00247 

0.00000 

35 

0.20071 

0.19446  -3.11295 

9 

0.07686 

0.01832  -76.16712 

36 

0.05640 

0.12389  119.68437 

10 

0.28083 

0.42035  49.68539 

37 

0.12500 

0.00228  -98.17360 

11 

0.10908 

0.02643  -75.76685 

38 

0.18799 

0.00000 

12 

0.27450 

0.20720  -24.51484 

39 

0.25447 

0.05185  -79.62345 

13 

0.00560 

0.00000 

40 

0.15270 

0.11565  -24.26390 

14 

0.02117 

0.00000 

41 

0.32043 

0.18921  -40.95185 

15 

0.17731 

0.22655  27.77088 

42 

0.00000 

0.00000 

16 

0.17146 

0.08028  -53.17796 

43 

0.14216 

0.00000 

17 

0.06113 

0.07385  20.80784 

44 

0.02379 

0.00169  -92.90068 

18 

0.09633 

0.05081  -47.25253 

45 

0.25197 

0.03685  -85.37581 

19 

0.00000 

0.00000 

46 

0.69705 

0.33444  -52.02037 

21 

0.37863 

0.14138  -62.65867 

47 

0.29992 

0.20607  -31.29334 

22 

0.43105 

0.07411  -82.80633 

48 

0.20898 

0.02941  -85.92811 

23 

0.03668 

0.00000 

49 

0.07552 

0.10736  42.16290 

24 

0.13116 

0.00000 

50 

0.11841 

0.36787  210.67824 

25 

0.03912 

0.01756  -55.12067 

— 

— 

—  — 

26 

27 

0.08003 

0.21324 

0.03321  -58.51035 

0.00000 

All 

0.16549 

0.10148  -38.68247 

Table  1:  Comparison  between  the  statAP' s  computed  for  the  results  returned  by  the 
baseline  (B)  and  the  results  returned  by  the  RF  algorithm  (RF).  Arf-b  denotes  the 
statAP  percentage  difference  between  the  baseline  and  the  RF  algorithm.  For  the  topics 
with  no  relevant  documents  among  the  top  five  retrieved,  only  the  baseline  statAP  is 
reported  since  Arf-b  =  0. 


'  Topic  20  was  dropped  since  none  of  the  Phase  1  runs  returned  relevant  documents 
for  such  topic. 
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Run  ID 

statAP 

CMU.  1 

0,08267 

hit2 . 1 

0,09385 

ilps .  2 

0,09844 

PRIS. 1 

0,17784 

QUT.  1 

0,13487 

Umas . 1 

0,09700 

UPD.  1 

0,10148 

Table  2:  statAP  computed  over  all  the  topics  with  regard  to  the  seven  Phase  1  sets 
assigned  to  UNIPD. 


Fig.  1:  statAP  reported  for  the  different  runs  based  on  the  assigned  Phase  1  sets  with 
regard  to  all  the  topics. 


Table  2  reports  the  statAP  computed  over  all  the  topics  with  regard  to  the 
Phase  1  runs  assigned  to  UNIPD  for  Phase  2.  The  results  show  that  PRIS .  1 
and  QUT .  1  were  able  to  provide  more  effective  evidence  to  perform  RF  than  the 
UNIPD  Phase  1  run  (UPD.  1).  But  when  the  statAP' s  are  considered  with  regard 
to  each  topic  —  see  Figure  1  —  there  is  no  Phase  1  set  which  provides  good 
evidence  for  feedback  to  the  tested  RF  algorithm  for  all  the  topics. 

In  regard  to  effectiveness  of  UPD .  1  as  source  for  feedback,  Table  3  reports  the 
number  of  runs  the  UNIPD  Phase  1  set  was  worse  (<)  and  better  (>)  than  the 
other  Phase  1  runs  with  regard  to  all  the  groups  to  which  UPD .  1  was  assigned 
the  values  reported  are  computed  over  all  the  topics. 

5  Concluding  Remarks 

The  results  reported  show  how  the  RF  algorithm  tested  is  less  effective  than  the 
baseline  in  terms  of  statAP. 

One  of  the  reasons  for  these  results  may  be  the  little  evidence  used  for  feed¬ 
back:  only  the  content  of  the  top  two  documents  judged  as  relevant  among  the 
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Measure 

<  > 

e-map 

13  7 

mapA 

4  3 

P10A 

4  3 

stAP 

12  8 

Table  3:  Impact  of  Phase  1  UNIPD  set  on  the  other  groups  which  used  such  run  as 
evidence  for  feedback. 


top  five  retrieved  were  used.  This  suggests  an  investigation  of  the  impact  of  the 
adopted  number  of  relevant  documents  on  the  effectiveness  of  the  RF  algorithm. 

The  adoption  of  the  AND  operator  instead  of  OR  to  construct  the  queries 
from  the  topic  keywords  may  improve  the  obtained  results  in  terms  of  precision. 
Moreover  one  issue  to  be  investigated  is  the  selection  of  the  features  to  build 
the  model  of  the  source.  Indeed  not  necessarily  the  features  with  the  highest 
TF-IDF  weights  in  the  feedback  documents  are  those  most  useful  for  feedback 
in  several  cases  we  observed  that  some  of  the  features  selected  were  not  related  to 
the  topic.  The  approach  adopted  to  model  the  content  of  the  feedback  documents 
as  a  vector  subspace  seems  to  help  in  the  event  of  the  wrong  selected  features. 
Indeed  the  weights  assigned  by  the  first  eigenvector  to  those  features  are  lower  ■ 
often  near  to  zero  —  than  that  assigned  to  features  related  to  the  topic.  Moreover 
the  query  was  not  expanded,  that  is  the  new  query  did  not  necessarily  includes 
the  topic  terms,  but  only  the  selected  features:  this  choice  might  have  hurt  the 
effectiveness  of  the  algorithm.  As  a  consequence  the  way  a  better  selection  of 
the  features  affects  the  tested  RF  algorithm  will  be  matter  of  investigation. 
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